Goto

Collaborating Authors

 aadirupa saha


DP-Dueling: Learning from Preference Feedback without Compromising User Privacy

arXiv.org Artificial Intelligence

Research has indicated that it is often more convenient, faster, and cost-effective to gather feedback in a relative manner rather than using absolute ratings [31, 40]. To illustrate, when assessing an individual's preference between two items, such as A and B, it is often easier for respondents to answer preference-oriented queries like "Which item do you prefer, A or B?" instead of requesting to rate items A and B on a scale ranging from 0 to 10. From the perspective of a system designer, leveraging this user preference data can significantly enhance system performance, especially when this data can be collected in a relative and online fashion. This applies to various real-world scenarios, including recommendation systems, crowd-sourcing platforms, training bots, multiplayer games, search engine optimization, online retail, and more. In many practical situations, particularly when human preferences are gathered online, such as designing surveys, expert reviews, product selection, search engine optimization, recommender systems, multiplayer game rankings, and even broader reinforcement learning problems with complex reward structures, it's often easier to elicit preference feedback instead of relying on absolute ratings or rewards. Because of its broad utility and the simplicity of gathering data using relative feedback, learning from preferences has become highly popular in the machine learning community. It has been extensively studied over the past decade under the name "Dueling-Bandits" (DB) in the literature. This framework is an extension of the traditional multi-armed bandit (MAB) setting, as described in [4]. In the DB framework, the goal is to identify a set of'good' options from a fixed decision


One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits

arXiv.org Artificial Intelligence

We address the problem of \emph{`Internal Regret'} in \emph{Sleeping Bandits} in the fully adversarial setup, as well as draw connections between different existing notions of sleeping regrets in the multiarmed bandits (MAB) literature and consequently analyze the implications: Our first contribution is to propose the new notion of \emph{Internal Regret} for sleeping MAB. We then proposed an algorithm that yields sublinear regret in that measure, even for a completely adversarial sequence of losses and availabilities. We further show that a low sleeping internal regret always implies a low external regret, and as well as a low policy regret for iid sequence of losses. The main contribution of this work precisely lies in unifying different notions of existing regret in sleeping bandits and understand the implication of one to another. Finally, we also extend our results to the setting of \emph{Dueling Bandits} (DB)--a preference feedback variant of MAB, and proposed a reduction to MAB idea to design a low regret algorithm for sleeping dueling bandits with stochastic preferences and adversarial availabilities. The efficacy of our algorithms is justified through empirical evaluations.